A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
In today's hospitality industry, the prevalence of booking cancellations poses significant challenges for hotels, impacting revenue, operational efficiency, and customer satisfaction. INN Hotels Group, a prominent chain of hotels in Portugal, is grappling with the detrimental effects of high cancellation rates.
The primary objective is to develop a Machine Learning (ML) solution capable of accurately predicting booking cancellations in advance. This predictive model will empower INN Hotels Group to anticipate and proactively address potential cancellations, thereby minimizing revenue loss, optimizing resource allocation, and enhancing overall operational efficiency. This will also allow INN Hotel to institute new profitably policies on cancellations and refunds.
# Installing the libraries with the specified version.
#!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 statsmodels==0.14.1 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
precision_recall_curve,
roc_curve,
make_scorer,
)
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# removing the limit for the number of displayed columns
pd.set_option("display.max_columns", None) # To set column limits replace None with a number
# setting the limit for the number of displayed rows
pd.set_option("display.max_rows", None) # To set row limits replace None with a number
# setting the precision of floating numbers to 2 decimal points
pd.set_option("display.float_format", lambda x: "%.6f" % x)
# Provided by GreatLearning
# function to create histogram and boxplot; both are aligned by mean
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Provided by GreatLearning
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Provided by GreatLearning
# function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0])
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
)
plt.tight_layout()
plt.show()
# Provided by GreatLearning
# Display a stacked barplot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# Purpose: Create Boxplot for multiple variables (x being a categorical value)
#
# Inputs:
#
# in_data: DataFrame object containing rows and columns of data
# x_feature: str representing the column name for the x-axis (categorical data)
# y_feature: str representing the column name for the y-axis
#
def multi_boxplot (in_data, x_feature, y_feature):
# Only proceed if the features is a single column string name and data is a DataFrame
if isinstance(in_data,pd.DataFrame) and type(x_feature) == str and type(y_feature) == str:
# visualizing the relationship between two featgures
plt.figure(figsize=(12, 5))
sns.boxplot(data=in_data, x=x_feature, y=y_feature, showmeans=True)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xticks(rotation='vertical')
plt.xlabel(x_feature, fontsize=15)
plt.ylabel(y_feature, fontsize=15);
plt.show()
#Outlier detection
def outlier_detection(data):
"""
Display a grid of box plots for each numeric feature; while showing the outlier data
data: dataframe
"""
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# dropping booking_status
numeric_columns.remove("booking_status")
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Purpose: To treat outliers by clipping them to the lower and upper whisker
#
# Inputs:
# df: Dataframe
# col: Feature that has outliers to treat
#
# Note: This procedure is being utilized from GreatLearning; Week 4 (Hands_on_Notebook_ExploratoryDataAnalysis)
def treat_outliers(df, col):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
# all the values smaller than lower_whisker will be assigned the value of lower_whisker
# all the values greater than upper_whisker will be assigned the value of upper_whisker
# the assignment will be done by using the clip function of NumPy
df[col] = np.clip(df[col], lower_whisker, upper_whisker)
return df
# Provided by GreatLearning
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# Provided by GreatLearning
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# Provided by GreatLearning
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Provided by GreatLearning
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
# Provided by GreatLearning
# defining a function to plot the precision vs recall vs threshold
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
#Import the data set
original_data = pd.read_csv ("./INNHotelsGroup.csv")
#Make a copy of the data
data = original_data.copy()
# Verify the data file was read correctly by displaying the first five rows.
data.head(5)
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.000000 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.680000 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.000000 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.000000 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.500000 | 0 | Canceled |
# Verify the entire data file was read correctly by displaying the last five rows.
data.tail(5)
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36270 | INN36271 | 3 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 4 | 85 | 2018 | 8 | 3 | Online | 0 | 0 | 0 | 167.800000 | 1 | Not_Canceled |
| 36271 | INN36272 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 228 | 2018 | 10 | 17 | Online | 0 | 0 | 0 | 90.950000 | 2 | Canceled |
| 36272 | INN36273 | 2 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148 | 2018 | 7 | 1 | Online | 0 | 0 | 0 | 98.390000 | 2 | Not_Canceled |
| 36273 | INN36274 | 2 | 0 | 0 | 3 | Not Selected | 0 | Room_Type 1 | 63 | 2018 | 4 | 21 | Online | 0 | 0 | 0 | 94.500000 | 0 | Canceled |
| 36274 | INN36275 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207 | 2018 | 12 | 30 | Offline | 0 | 0 | 0 | 161.670000 | 0 | Not_Canceled |
#Check the size of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} features.")
There are 36275 rows and 19 features.
#Check overall information on the features
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
#Show the statistical summary of the data
data.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Booking_ID | 36275 | 36275 | INN00001 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| no_of_adults | 36275.000000 | NaN | NaN | NaN | 1.844962 | 0.518715 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 4.000000 |
| no_of_children | 36275.000000 | NaN | NaN | NaN | 0.105279 | 0.402648 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10.000000 |
| no_of_weekend_nights | 36275.000000 | NaN | NaN | NaN | 0.810724 | 0.870644 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 7.000000 |
| no_of_week_nights | 36275.000000 | NaN | NaN | NaN | 2.204300 | 1.410905 | 0.000000 | 1.000000 | 2.000000 | 3.000000 | 17.000000 |
| type_of_meal_plan | 36275 | 4 | Meal Plan 1 | 27835 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| required_car_parking_space | 36275.000000 | NaN | NaN | NaN | 0.030986 | 0.173281 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| room_type_reserved | 36275 | 7 | Room_Type 1 | 28130 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| lead_time | 36275.000000 | NaN | NaN | NaN | 85.232557 | 85.930817 | 0.000000 | 17.000000 | 57.000000 | 126.000000 | 443.000000 |
| arrival_year | 36275.000000 | NaN | NaN | NaN | 2017.820427 | 0.383836 | 2017.000000 | 2018.000000 | 2018.000000 | 2018.000000 | 2018.000000 |
| arrival_month | 36275.000000 | NaN | NaN | NaN | 7.423653 | 3.069894 | 1.000000 | 5.000000 | 8.000000 | 10.000000 | 12.000000 |
| arrival_date | 36275.000000 | NaN | NaN | NaN | 15.596995 | 8.740447 | 1.000000 | 8.000000 | 16.000000 | 23.000000 | 31.000000 |
| market_segment_type | 36275 | 5 | Online | 23214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| repeated_guest | 36275.000000 | NaN | NaN | NaN | 0.025637 | 0.158053 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| no_of_previous_cancellations | 36275.000000 | NaN | NaN | NaN | 0.023349 | 0.368331 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 13.000000 |
| no_of_previous_bookings_not_canceled | 36275.000000 | NaN | NaN | NaN | 0.153411 | 1.754171 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 58.000000 |
| avg_price_per_room | 36275.000000 | NaN | NaN | NaN | 103.423539 | 35.089424 | 0.000000 | 80.300000 | 99.450000 | 120.000000 | 540.000000 |
| no_of_special_requests | 36275.000000 | NaN | NaN | NaN | 0.619655 | 0.786236 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 5.000000 |
| booking_status | 36275 | 2 | Not_Canceled | 24390 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
type_of_meal_plan; canidiate for dummy categoryroom_type_reserved; candidate for dummy categorymarket_segment_type; candidate for dummy categorybooking_status; this is the dependent variable (Y value).no_of_adults is 1.844; while the average no_of_children is only 0.105.lead_time is 85.23 days.avg_price_per_room is 103.42 euros; while the max average price is 540 euros.Room_Type 1.# Check for missing values.
data.isnull().sum()
Booking_ID 0 no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64
data.nunique()
Booking_ID 36275 no_of_adults 5 no_of_children 6 no_of_weekend_nights 8 no_of_week_nights 18 type_of_meal_plan 4 required_car_parking_space 2 room_type_reserved 7 lead_time 352 arrival_year 2 arrival_month 12 arrival_date 31 market_segment_type 5 repeated_guest 2 no_of_previous_cancellations 9 no_of_previous_bookings_not_canceled 59 avg_price_per_room 3930 no_of_special_requests 6 booking_status 2 dtype: int64
arrival_year only have 2 unique values, this data set only has two (2) years worth of data.no_of_previous_bookings_not_canceled having a value of 59 indicates a potential for repeat customers and a future loyalty program (if not already established).booking_status has two unique values; which indicates either booked or cancelled. # Check for duplicate values in the "Booking_ID" column
duplicate_booking_ids = data[data.duplicated(subset=['Booking_ID'], keep=False)]
# If there are duplicate booking IDs, they will need to be removed.
if duplicate_booking_ids.empty:
print("No duplicate Booking_IDs found.")
else:
print("Duplicate Booking_IDs found:")
print(duplicate_booking_ids)
No duplicate Booking_IDs found.
Leading Questions:
data.drop(columns=["Booking_ID"],axis=1,inplace=True)
data["booking_status"] = data["booking_status"].apply(
lambda x: 1 if x == "Canceled" else 0
)
# Ensure consistent values for object features
# Loop through each column
for column in data.columns:
if data[column].dtype == 'object': # Check if column dtype is object (categorical)
unique_values = data[column].unique()
print(f"Unique values for column '{column}':")
for value in unique_values:
print("\t * ",value)
Unique values for column 'type_of_meal_plan': * Meal Plan 1 * Not Selected * Meal Plan 2 * Meal Plan 3 Unique values for column 'room_type_reserved': * Room_Type 1 * Room_Type 4 * Room_Type 2 * Room_Type 6 * Room_Type 5 * Room_Type 7 * Room_Type 3 Unique values for column 'market_segment_type': * Offline * Online * Corporate * Aviation * Complementary
#Verify the column was dropped successfuly
print(f"There are {data.shape[0]} rows and {data.shape[1]} features.")
data.head(5)
There are 36275 rows and 18 features.
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.000000 | 0 | 0 |
| 1 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.680000 | 1 | 0 |
| 2 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.000000 | 0 | 1 |
| 3 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.000000 | 0 | 1 |
| 4 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.500000 | 0 | 1 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36275 non-null int64 1 no_of_children 36275 non-null int64 2 no_of_weekend_nights 36275 non-null int64 3 no_of_week_nights 36275 non-null int64 4 type_of_meal_plan 36275 non-null object 5 required_car_parking_space 36275 non-null int64 6 room_type_reserved 36275 non-null object 7 lead_time 36275 non-null int64 8 arrival_year 36275 non-null int64 9 arrival_month 36275 non-null int64 10 arrival_date 36275 non-null int64 11 market_segment_type 36275 non-null object 12 repeated_guest 36275 non-null int64 13 no_of_previous_cancellations 36275 non-null int64 14 no_of_previous_bookings_not_canceled 36275 non-null int64 15 avg_price_per_room 36275 non-null float64 16 no_of_special_requests 36275 non-null int64 17 booking_status 36275 non-null int64 dtypes: float64(1), int64(14), object(3) memory usage: 5.0+ MB
# Show the statistical summary
data.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.000000 | NaN | NaN | NaN | 1.844962 | 0.518715 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 4.000000 |
| no_of_children | 36275.000000 | NaN | NaN | NaN | 0.105279 | 0.402648 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10.000000 |
| no_of_weekend_nights | 36275.000000 | NaN | NaN | NaN | 0.810724 | 0.870644 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 7.000000 |
| no_of_week_nights | 36275.000000 | NaN | NaN | NaN | 2.204300 | 1.410905 | 0.000000 | 1.000000 | 2.000000 | 3.000000 | 17.000000 |
| type_of_meal_plan | 36275 | 4 | Meal Plan 1 | 27835 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| required_car_parking_space | 36275.000000 | NaN | NaN | NaN | 0.030986 | 0.173281 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| room_type_reserved | 36275 | 7 | Room_Type 1 | 28130 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| lead_time | 36275.000000 | NaN | NaN | NaN | 85.232557 | 85.930817 | 0.000000 | 17.000000 | 57.000000 | 126.000000 | 443.000000 |
| arrival_year | 36275.000000 | NaN | NaN | NaN | 2017.820427 | 0.383836 | 2017.000000 | 2018.000000 | 2018.000000 | 2018.000000 | 2018.000000 |
| arrival_month | 36275.000000 | NaN | NaN | NaN | 7.423653 | 3.069894 | 1.000000 | 5.000000 | 8.000000 | 10.000000 | 12.000000 |
| arrival_date | 36275.000000 | NaN | NaN | NaN | 15.596995 | 8.740447 | 1.000000 | 8.000000 | 16.000000 | 23.000000 | 31.000000 |
| market_segment_type | 36275 | 5 | Online | 23214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| repeated_guest | 36275.000000 | NaN | NaN | NaN | 0.025637 | 0.158053 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| no_of_previous_cancellations | 36275.000000 | NaN | NaN | NaN | 0.023349 | 0.368331 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 13.000000 |
| no_of_previous_bookings_not_canceled | 36275.000000 | NaN | NaN | NaN | 0.153411 | 1.754171 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 58.000000 |
| avg_price_per_room | 36275.000000 | NaN | NaN | NaN | 103.423539 | 35.089424 | 0.000000 | 80.300000 | 99.450000 | 120.000000 | 540.000000 |
| no_of_special_requests | 36275.000000 | NaN | NaN | NaN | 0.619655 | 0.786236 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 5.000000 |
| booking_status | 36275.000000 | NaN | NaN | NaN | 0.327636 | 0.469358 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
#Show listing of all the columns
data.columns
Index(['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
'no_of_week_nights', 'type_of_meal_plan', 'required_car_parking_space',
'room_type_reserved', 'lead_time', 'arrival_year', 'arrival_month',
'arrival_date', 'market_segment_type', 'repeated_guest',
'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
'avg_price_per_room', 'no_of_special_requests', 'booking_status'],
dtype='object')
no_of_adults¶labeled_barplot(data, feature="no_of_adults", perc=True)
no_of_children¶labeled_barplot(data, feature="no_of_children", perc=True)
no_of_weekend_nights¶labeled_barplot(data, feature="no_of_weekend_nights", perc=True)
no_of_week_nights¶labeled_barplot(data, feature="no_of_week_nights", perc=True)
type_of_meal_plan¶labeled_barplot(data, feature="type_of_meal_plan", perc=True)
Meal Plan 1required_car_parking_space¶labeled_barplot(data, feature="required_car_parking_space", perc=True)
room_type_reserved¶labeled_barplot(data, feature="room_type_reserved", perc=True)
Room_Type 1 is selected.Room_type 4 is selected.arrival_year¶labeled_barplot(data, feature="arrival_year", perc=True)
#Let's investigate 2017 a bit further
data[data['arrival_year']==2017]["arrival_month"].value_counts()
10 1913 9 1649 8 1014 12 928 11 647 7 363 Name: arrival_month, dtype: int64
#Let's investigate 2018 a bit further
data[data['arrival_year']==2018]["arrival_month"].value_counts()
10 3404 6 3203 9 2962 8 2799 4 2736 5 2598 7 2557 3 2358 11 2333 12 2093 2 1704 1 1014 Name: arrival_month, dtype: int64
arrival_date¶labeled_barplot(data, feature="arrival_date", perc=True)
market_segment_type¶labeled_barplot(data, feature="market_segment_type", perc=True)
repeated_guest¶labeled_barplot(data, feature="repeated_guest", perc=True)
labeled_barplot(data, feature="no_of_special_requests", perc=True)
booking_status¶labeled_barplot(data, feature="booking_status", perc=True)
lead_time¶histogram_boxplot(data, feature="lead_time")
lead_time has quite a few outliers beyond the max whiskerno_of_previous_cancellations¶histogram_boxplot(data, feature="no_of_previous_cancellations")
no_of_previous_bookings_not_canceled¶histogram_boxplot(data, feature="no_of_previous_bookings_not_canceled")
avg_price_per_room¶histogram_boxplot(data, feature="avg_price_per_room",kde=True)
# Display the numeric fields in a heatmap to determine if there are any correlations between features
cols_list = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(12, 7))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
no_of_previous_bookings_not_canceled and repeated_guestavg_price_per_room may impact booking_status.¶distribution_plot_wrt_target(data, "avg_price_per_room", "booking_status")
avg_price_per_room is slightly higher for those than have cancelled vs those that have not.lead_time may impact booking_status¶distribution_plot_wrt_target(data, "lead_time", "booking_status")
lead_time of approximately 125 days.lead_time of less than 50 days.type_of_meal_plan impacts booking_status¶stacked_barplot(data, "type_of_meal_plan", "booking_status")
booking_status 0 1 All type_of_meal_plan All 24390 11885 36275 Meal Plan 1 19156 8679 27835 Not Selected 3431 1699 5130 Meal Plan 2 1799 1506 3305 Meal Plan 3 4 1 5 ------------------------------------------------------------------------------------------------------------------------
Meal Plan 2 have the highest percentage of cancellations.Meal Plan 3 has the lowest percentage of cancellations. However, the number of bookings for this meal plan is very low and insignificant.Meal Plan 1 to help decrease cancellations or make improvements to Meal Plan 2 to closer mimic the successes of Meal Plan 1.booking_status.¶# Create a new field for total guests and add the number of adults with the number of children traveling.
total_guests_data = data.copy()
# Add up the total number of guests traveling for each booking
total_guests_data["no_of_guests"] = (
total_guests_data["no_of_adults"] + total_guests_data["no_of_children"]
)
# Display the stacked barplot for no_of_guests vs booking_status
stacked_barplot(total_guests_data, "no_of_guests", "booking_status")
booking_status 0 1 All no_of_guests All 24390 11885 36275 2 15662 8280 23942 1 5743 1809 7552 3 2459 1392 3851 4 514 398 912 5 10 5 15 11 0 1 1 10 1 0 1 12 1 0 1 ------------------------------------------------------------------------------------------------------------------------
# Create a new field for total number of days and add the number of week nights with the number of weekends nights.
#Make a temporary copy of the data
total_nights_data = data.copy()
# Add up the total number of week night for each booking
total_nights_data["total_nights"] = (
total_nights_data["no_of_week_nights"] + total_nights_data["no_of_weekend_nights"]
)
# View the total counts for each total night value to help reduce the insignificant data values.
total_nights_data["total_nights"].value_counts()
3 10052 2 8472 1 6604 4 5893 5 2589 6 1031 7 973 8 179 9 111 10 109 0 78 11 39 14 32 15 31 12 24 13 18 20 11 19 6 16 6 17 5 21 4 18 3 23 2 22 2 24 1 Name: total_nights, dtype: int64
# Let's remove some of the data containing insignificant counts for easier analysis on the stacked barplot.
total_nights_data = total_nights_data[total_nights_data["total_nights"] <= 15]
# Review the info again
total_nights_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 36235 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36235 non-null int64 1 no_of_children 36235 non-null int64 2 no_of_weekend_nights 36235 non-null int64 3 no_of_week_nights 36235 non-null int64 4 type_of_meal_plan 36235 non-null object 5 required_car_parking_space 36235 non-null int64 6 room_type_reserved 36235 non-null object 7 lead_time 36235 non-null int64 8 arrival_year 36235 non-null int64 9 arrival_month 36235 non-null int64 10 arrival_date 36235 non-null int64 11 market_segment_type 36235 non-null object 12 repeated_guest 36235 non-null int64 13 no_of_previous_cancellations 36235 non-null int64 14 no_of_previous_bookings_not_canceled 36235 non-null int64 15 avg_price_per_room 36235 non-null float64 16 no_of_special_requests 36235 non-null int64 17 booking_status 36235 non-null int64 18 total_nights 36235 non-null int64 dtypes: float64(1), int64(15), object(3) memory usage: 5.5+ MB
# Display the stacked barplot for no_of_guests vs booking_status
stacked_barplot(total_nights_data, "total_nights", "booking_status")
booking_status 0 1 All total_nights All 24382 11853 36235 3 6466 3586 10052 2 5573 2899 8472 4 3952 1941 5893 1 5138 1466 6604 5 1766 823 2589 6 566 465 1031 7 590 383 973 8 100 79 179 10 51 58 109 9 58 53 111 14 5 27 32 15 5 26 31 11 24 15 39 12 9 15 24 13 3 15 18 0 76 2 78 ------------------------------------------------------------------------------------------------------------------------
room_type_reserved has an impact on booking_status¶# Display the stacked barplot for no_of_guests vs booking_status
stacked_barplot(total_guests_data, "room_type_reserved", "booking_status")
booking_status 0 1 All room_type_reserved All 24390 11885 36275 Room_Type 1 19058 9072 28130 Room_Type 4 3988 2069 6057 Room_Type 6 560 406 966 Room_Type 2 464 228 692 Room_Type 5 193 72 265 Room_Type 7 122 36 158 Room_Type 3 5 2 7 ------------------------------------------------------------------------------------------------------------------------
required_car_parking_space impacts booking_status¶# Display the stacked barplot for no_of_guests vs booking_status
stacked_barplot(total_guests_data, "required_car_parking_space", "booking_status")
booking_status 0 1 All required_car_parking_space All 24390 11885 36275 0 23380 11771 35151 1 1010 114 1124 ------------------------------------------------------------------------------------------------------------------------
Leading Questions:
# grouping the data on arrival months and extracting the count of bookings
monthly_data = data.groupby(["arrival_month"])["booking_status"].count()
print(monthly_data)
print(monthly_data.values)
# creating a dataframe with months and count of customers in each month
monthly_data = pd.DataFrame(
{"Month": list(monthly_data.index), "Number of Bookings": list(monthly_data.values)}
)
# plotting the trend over different months
plt.figure(figsize=(10, 5))
sns.lineplot(data=monthly_data, x="Month", y="Number of Bookings")
plt.show()
arrival_month 1 1014 2 1704 3 2358 4 2736 5 2598 6 3203 7 2920 8 3813 9 4611 10 5317 11 2980 12 3021 Name: booking_status, dtype: int64 [1014 1704 2358 2736 2598 3203 2920 3813 4611 5317 2980 3021]
# Let's determine the marget segement that most of the guests come from
labeled_barplot(data, feature="market_segment_type", perc=True)
# Let's also plot the market_segement_type vs booking_status
stacked_barplot(data, "market_segment_type", "booking_status")
booking_status 0 1 All market_segment_type All 24390 11885 36275 Online 14739 8475 23214 Offline 7375 3153 10528 Corporate 1797 220 2017 Aviation 88 37 125 Complementary 391 0 391 ------------------------------------------------------------------------------------------------------------------------
# Display multi-boxplots by marget_segement_type
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x="market_segment_type", y="avg_price_per_room")
plt.show()
# Grouping the data on market_segment_type and then take the median of avg_price_per_room
market_segment_data = data.groupby(["market_segment_type"])["avg_price_per_room"].median()
market_segment_data
market_segment_type Aviation 95.000000 Complementary 0.000000 Corporate 79.000000 Offline 90.000000 Online 107.100000 Name: avg_price_per_room, dtype: float64
avg_price_per_room value of 107 euros is the largest for the Online market segment.avg_price_per_room value is for the Complementary markget segment; which makes sense.# Determine the percentage of bookings that are cancelled
labeled_barplot(data, feature="booking_status", perc=True)
# Create a stacked barplot of arrival_months vs booking_status
stacked_barplot(data,"arrival_month","booking_status")
booking_status 0 1 All arrival_month All 24390 11885 36275 10 3437 1880 5317 9 3073 1538 4611 8 2325 1488 3813 7 1606 1314 2920 6 1912 1291 3203 4 1741 995 2736 5 1650 948 2598 11 2105 875 2980 3 1658 700 2358 2 1274 430 1704 12 2619 402 3021 1 990 24 1014 ------------------------------------------------------------------------------------------------------------------------
# Create a stacked barplot
stacked_barplot(data,"repeated_guest","booking_status")
# Calculate percentage of cancellations for each repeated_guest value
cancellation_percentage = data.groupby(['repeated_guest', 'booking_status']).size().unstack(fill_value=0)
cancellation_percentage = cancellation_percentage.apply(lambda x: x / x.sum(), axis=1) * 100
print (cancellation_percentage)
booking_status 0 1 All repeated_guest All 24390 11885 36275 0 23476 11869 35345 1 914 16 930 ------------------------------------------------------------------------------------------------------------------------
booking_status 0 1 repeated_guest 0 66.419578 33.580422 1 98.279570 1.720430
stacked_barplot(data, "no_of_special_requests", "booking_status")
booking_status 0 1 All no_of_special_requests All 24390 11885 36275 0 11232 8545 19777 1 8670 2703 11373 2 3727 637 4364 3 675 0 675 4 78 0 78 5 8 0 8 ------------------------------------------------------------------------------------------------------------------------
outlier_detection(data)
# specifying the independent and dependent variables
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]
# adding a constant to the independent variables
X = sm.add_constant(X)
# creating dummy variables
X = pd.get_dummies(X, drop_first=True)
# splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (25392, 28) Shape of test set : (10883, 28) Percentage of classes in training set: 0 0.670644 1 0.329356 Name: booking_status, dtype: float64 Percentage of classes in test set: 0 0.676376 1 0.323624 Name: booking_status, dtype: float64
We will now perform logistic regression using statsmodels, a Python module that provides functions for the estimation of many statistical models, as well as for conducting statistical tests, and statistical data exploration.
Using statsmodels, we will be able to check the statistical validity of our model - identify the significant predictors from p-values that we get for each predictor variable.
# fitting logistic regression model
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(disp=False)
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25364
Method: MLE Df Model: 27
Date: Fri, 19 Apr 2024 Pseudo R-squ.: 0.3293
Time: 17:09:55 Log-Likelihood: -10793.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -924.5923 120.817 -7.653 0.000 -1161.390 -687.795
no_of_adults 0.1135 0.038 3.017 0.003 0.040 0.187
no_of_children 0.1563 0.057 2.732 0.006 0.044 0.268
no_of_weekend_nights 0.1068 0.020 5.398 0.000 0.068 0.146
no_of_week_nights 0.0398 0.012 3.239 0.001 0.016 0.064
required_car_parking_space -1.5939 0.138 -11.561 0.000 -1.864 -1.324
lead_time 0.0157 0.000 58.868 0.000 0.015 0.016
arrival_year 0.4570 0.060 7.633 0.000 0.340 0.574
arrival_month -0.0415 0.006 -6.418 0.000 -0.054 -0.029
arrival_date 0.0005 0.002 0.252 0.801 -0.003 0.004
repeated_guest -2.3469 0.617 -3.805 0.000 -3.556 -1.138
no_of_previous_cancellations 0.2664 0.086 3.108 0.002 0.098 0.434
no_of_previous_bookings_not_canceled -0.1727 0.153 -1.131 0.258 -0.472 0.127
avg_price_per_room 0.0188 0.001 25.404 0.000 0.017 0.020
no_of_special_requests -1.4690 0.030 -48.790 0.000 -1.528 -1.410
type_of_meal_plan_Meal Plan 2 0.1768 0.067 2.654 0.008 0.046 0.307
type_of_meal_plan_Meal Plan 3 17.8379 5057.771 0.004 0.997 -9895.211 9930.887
type_of_meal_plan_Not Selected 0.2782 0.053 5.245 0.000 0.174 0.382
room_type_reserved_Room_Type 2 -0.3610 0.131 -2.761 0.006 -0.617 -0.105
room_type_reserved_Room_Type 3 -0.0009 1.310 -0.001 0.999 -2.569 2.567
room_type_reserved_Room_Type 4 -0.2821 0.053 -5.305 0.000 -0.386 -0.178
room_type_reserved_Room_Type 5 -0.7176 0.209 -3.432 0.001 -1.127 -0.308
room_type_reserved_Room_Type 6 -0.9456 0.147 -6.434 0.000 -1.234 -0.658
room_type_reserved_Room_Type 7 -1.3964 0.293 -4.767 0.000 -1.971 -0.822
market_segment_type_Complementary -41.8798 8.42e+05 -4.98e-05 1.000 -1.65e+06 1.65e+06
market_segment_type_Corporate -1.1935 0.266 -4.487 0.000 -1.715 -0.672
market_segment_type_Offline -2.1955 0.255 -8.625 0.000 -2.694 -1.697
market_segment_type_Online -0.3990 0.251 -1.588 0.112 -0.891 0.093
========================================================================================================
Negative values of the coefficient show that the probability of a guest cancelling a booking decreases with the increase of the corresponding attribute value.
Positive values of the coefficient show that the probability of a guest cancelling a booking increases with the increase of the corresponding attribute value.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.
Model can make wrong predictions as:
Predicting a guest will cancel but in reality they do not cancel.
Predicting a guest will not cancel but in reality they do cancel.
Which case is more important?**
Both the cases are important as:
(False Positive): If we predict a guest will cancel, but actually they do not will result in the hotel overbooking their rooms resulting in unsatisfied customers.
(False Negative): If we predict a person will not cancel, but actually they do will result in the hotel losing revenues by not fully booking their rooms to capacity.
Therefore, both of these scenarios (Type I and Type II errors) are important and we therefore, want to minimize both.
How to reduce this loss?**
We need to reduce both False Negatives (Recall) and False Positives (Precision)
f1_score should be maximized as the greater the f1_score, the higher the chances of reducing both False Negatives and False Positives and identifying both the classes correctly
# Convert to float because if you do not it causes problems when creating the confusion matrix
X_train = X_train.astype(float)
# Display the confusion matrix
confusion_matrix_statsmodels(lg, X_train, y_train)
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.806041 | 0.634222 | 0.739749 | 0.682933 |
The f1_score of the model is ~0.68 and we will try to maximize it further
The variables used to build the model might contain multicollinearity, which will affect the p-values
There are different ways of detecting (or testing for) multicollinearity. One such way is using the Variation Inflation Factor (VIF).
Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient $\beta_k$ is "inflated" by the existence of correlation among the predictor variables in the model.
General Rule of thumb:
The purpose of the analysis should dictate which threshold to use
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: const 39468156.706004 no_of_adults 1.348154 no_of_children 1.978229 no_of_weekend_nights 1.069475 no_of_week_nights 1.095667 required_car_parking_space 1.039928 lead_time 1.394914 arrival_year 1.430830 arrival_month 1.275673 arrival_date 1.006738 repeated_guest 1.783516 no_of_previous_cancellations 1.395689 no_of_previous_bookings_not_canceled 1.651986 avg_price_per_room 2.050421 no_of_special_requests 1.247278 type_of_meal_plan_Meal Plan 2 1.271851 type_of_meal_plan_Meal Plan 3 1.025216 type_of_meal_plan_Not Selected 1.272183 room_type_reserved_Room_Type 2 1.101438 room_type_reserved_Room_Type 3 1.003302 room_type_reserved_Room_Type 4 1.361515 room_type_reserved_Room_Type 5 1.027810 room_type_reserved_Room_Type 6 1.973072 room_type_reserved_Room_Type 7 1.115123 market_segment_type_Complementary 4.500109 market_segment_type_Corporate 16.928435 market_segment_type_Offline 64.113924 market_segment_type_Online 71.176430 dtype: float64
market_segement_type dummy categories have very high VIF values.market_segment_type_Online dummy category field and re-assess the vif values.# Let's drop the market_segment_type_Online column from both the x_train and x_test data frames
col_to_drop = "market_segment_type_Online"
X_train1 = X_train.loc[:, ~X_train.columns.str.startswith(col_to_drop)]
X_test1 = X_test.loc[:, ~X_test.columns.str.startswith(col_to_drop)]
# Reassess the VIF values
vif_series = pd.Series(
[variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])],
index=X_train1.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: const 39391371.314593 no_of_adults 1.331784 no_of_children 1.977350 no_of_weekend_nights 1.069039 no_of_week_nights 1.095118 required_car_parking_space 1.039795 lead_time 1.390637 arrival_year 1.428376 arrival_month 1.274625 arrival_date 1.006721 repeated_guest 1.780188 no_of_previous_cancellations 1.395447 no_of_previous_bookings_not_canceled 1.651745 avg_price_per_room 2.049595 no_of_special_requests 1.242418 type_of_meal_plan_Meal Plan 2 1.271497 type_of_meal_plan_Meal Plan 3 1.025216 type_of_meal_plan_Not Selected 1.270387 room_type_reserved_Room_Type 2 1.101271 room_type_reserved_Room_Type 3 1.003301 room_type_reserved_Room_Type 4 1.356004 room_type_reserved_Room_Type 5 1.027810 room_type_reserved_Room_Type 6 1.972732 room_type_reserved_Room_Type 7 1.115003 market_segment_type_Complementary 1.338253 market_segment_type_Corporate 1.527769 market_segment_type_Offline 1.597418 dtype: float64
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(disp=False)
print("Training performance:")
model_performance_classification_statsmodels(lg1, X_train1, y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.805766 | 0.633744 | 0.739294 | 0.682462 |
# Let's review the summary for additional p-value analysis
print(lg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25365
Method: MLE Df Model: 26
Date: Fri, 19 Apr 2024 Pseudo R-squ.: 0.3292
Time: 17:10:06 Log-Likelihood: -10794.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -933.3324 120.655 -7.736 0.000 -1169.813 -696.852
no_of_adults 0.1060 0.037 2.841 0.004 0.033 0.179
no_of_children 0.1542 0.057 2.694 0.007 0.042 0.266
no_of_weekend_nights 0.1075 0.020 5.439 0.000 0.069 0.146
no_of_week_nights 0.0405 0.012 3.295 0.001 0.016 0.065
required_car_parking_space -1.5907 0.138 -11.538 0.000 -1.861 -1.320
lead_time 0.0157 0.000 58.933 0.000 0.015 0.016
arrival_year 0.4611 0.060 7.711 0.000 0.344 0.578
arrival_month -0.0411 0.006 -6.358 0.000 -0.054 -0.028
arrival_date 0.0005 0.002 0.257 0.797 -0.003 0.004
repeated_guest -2.3140 0.618 -3.743 0.000 -3.526 -1.102
no_of_previous_cancellations 0.2633 0.086 3.074 0.002 0.095 0.431
no_of_previous_bookings_not_canceled -0.1728 0.152 -1.136 0.256 -0.471 0.125
avg_price_per_room 0.0187 0.001 25.374 0.000 0.017 0.020
no_of_special_requests -1.4709 0.030 -48.891 0.000 -1.530 -1.412
type_of_meal_plan_Meal Plan 2 0.1794 0.067 2.694 0.007 0.049 0.310
type_of_meal_plan_Meal Plan 3 19.8256 1.36e+04 0.001 0.999 -2.67e+04 2.67e+04
type_of_meal_plan_Not Selected 0.2745 0.053 5.181 0.000 0.171 0.378
room_type_reserved_Room_Type 2 -0.3640 0.131 -2.784 0.005 -0.620 -0.108
room_type_reserved_Room_Type 3 -0.0018 1.310 -0.001 0.999 -2.569 2.566
room_type_reserved_Room_Type 4 -0.2763 0.053 -5.207 0.000 -0.380 -0.172
room_type_reserved_Room_Type 5 -0.7182 0.209 -3.436 0.001 -1.128 -0.308
room_type_reserved_Room_Type 6 -0.9408 0.147 -6.402 0.000 -1.229 -0.653
room_type_reserved_Room_Type 7 -1.3891 0.293 -4.743 0.000 -1.963 -0.815
market_segment_type_Complementary -47.7454 7.09e+06 -6.74e-06 1.000 -1.39e+07 1.39e+07
market_segment_type_Corporate -0.8033 0.103 -7.807 0.000 -1.005 -0.602
market_segment_type_Offline -1.7995 0.052 -34.577 0.000 -1.902 -1.698
========================================================================================================
arrival_dateno_of_previous_bookings_not_canceledtype_of_meal_plan_Meal Plan 3room_type_reserved_Room_Type 3market_segment_type_ComplementaryNote: The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.
# initial list of columns
cols = X_train1.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
X_train_aux = X_train1[cols]
# fitting the model
model = sm.Logit(y_train, X_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Corporate', 'market_segment_type_Offline']
# Let's create new X_train and X_test sets using only the selected features (they should all have p-values < .05)
X_train2 = X_train1[selected_features]
X_test2 = X_test1[selected_features]
# Review the training set feature set
X_train2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25392 entries, 13662 to 33003 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 const 25392 non-null float64 1 no_of_adults 25392 non-null float64 2 no_of_children 25392 non-null float64 3 no_of_weekend_nights 25392 non-null float64 4 no_of_week_nights 25392 non-null float64 5 required_car_parking_space 25392 non-null float64 6 lead_time 25392 non-null float64 7 arrival_year 25392 non-null float64 8 arrival_month 25392 non-null float64 9 repeated_guest 25392 non-null float64 10 no_of_previous_cancellations 25392 non-null float64 11 avg_price_per_room 25392 non-null float64 12 no_of_special_requests 25392 non-null float64 13 type_of_meal_plan_Meal Plan 2 25392 non-null float64 14 type_of_meal_plan_Not Selected 25392 non-null float64 15 room_type_reserved_Room_Type 2 25392 non-null float64 16 room_type_reserved_Room_Type 4 25392 non-null float64 17 room_type_reserved_Room_Type 5 25392 non-null float64 18 room_type_reserved_Room_Type 6 25392 non-null float64 19 room_type_reserved_Room_Type 7 25392 non-null float64 20 market_segment_type_Corporate 25392 non-null float64 21 market_segment_type_Offline 25392 non-null float64 dtypes: float64(22) memory usage: 4.5 MB
logit2 = sm.Logit(y_train, X_train2.astype(float))
lg2 = logit2.fit(disp=False)
print(lg2.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25370
Method: MLE Df Model: 21
Date: Fri, 19 Apr 2024 Pseudo R-squ.: 0.3283
Time: 17:10:12 Log-Likelihood: -10809.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -917.2860 120.456 -7.615 0.000 -1153.376 -681.196
no_of_adults 0.1086 0.037 2.914 0.004 0.036 0.182
no_of_children 0.1522 0.057 2.660 0.008 0.040 0.264
no_of_weekend_nights 0.1086 0.020 5.501 0.000 0.070 0.147
no_of_week_nights 0.0418 0.012 3.403 0.001 0.018 0.066
required_car_parking_space -1.5943 0.138 -11.561 0.000 -1.865 -1.324
lead_time 0.0157 0.000 59.218 0.000 0.015 0.016
arrival_year 0.4531 0.060 7.591 0.000 0.336 0.570
arrival_month -0.0424 0.006 -6.568 0.000 -0.055 -0.030
repeated_guest -2.7365 0.557 -4.915 0.000 -3.828 -1.645
no_of_previous_cancellations 0.2289 0.077 2.983 0.003 0.078 0.379
avg_price_per_room 0.0192 0.001 26.343 0.000 0.018 0.021
no_of_special_requests -1.4699 0.030 -48.892 0.000 -1.529 -1.411
type_of_meal_plan_Meal Plan 2 0.1654 0.067 2.487 0.013 0.035 0.296
type_of_meal_plan_Not Selected 0.2858 0.053 5.405 0.000 0.182 0.389
room_type_reserved_Room_Type 2 -0.3560 0.131 -2.725 0.006 -0.612 -0.100
room_type_reserved_Room_Type 4 -0.2826 0.053 -5.330 0.000 -0.387 -0.179
room_type_reserved_Room_Type 5 -0.7352 0.208 -3.529 0.000 -1.143 -0.327
room_type_reserved_Room_Type 6 -0.9650 0.147 -6.572 0.000 -1.253 -0.677
room_type_reserved_Room_Type 7 -1.4312 0.293 -4.892 0.000 -2.005 -0.858
market_segment_type_Corporate -0.7928 0.103 -7.711 0.000 -0.994 -0.591
market_segment_type_Offline -1.7867 0.052 -34.391 0.000 -1.889 -1.685
==================================================================================================
print("Training performance:")
model_performance_classification_statsmodels(lg2, X_train2, y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.805411 | 0.632548 | 0.739033 | 0.681657 |
# converting coefficients to odds
odds = np.exp(lg2.params)
# finding the percentage change
perc_change_odds = (np.exp(lg2.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train2.columns).T
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 0.000000 | 1.114754 | 1.164360 | 1.114753 | 1.042636 | 0.203048 | 1.015835 | 1.573235 | 0.958528 | 0.064797 | 1.257157 | 1.019348 | 0.229941 | 1.179916 | 1.330892 | 0.700461 | 0.753830 | 0.479403 | 0.380991 | 0.239033 | 0.452584 | 0.167504 |
| Change_odd% | -100.000000 | 11.475363 | 16.436009 | 11.475256 | 4.263629 | -79.695231 | 1.583521 | 57.323511 | -4.147245 | -93.520258 | 25.715665 | 1.934790 | -77.005947 | 17.991562 | 33.089244 | -29.953888 | -24.617006 | -52.059666 | -61.900934 | -76.096691 | -54.741616 | -83.249628 |
Coefficient interpretations
no_of_adults: Holding all other features constant a 1 unit change in no_of_adults will increase the odds of the guest cancelling by ~1.11 times or a ~11.48% increase in odds of cancelling.no_of_previous_cancellations: Holding all other features constant a 1 unit change in no_of_previous_cancellations will increase the odds of the guest cancelling by ~1.25 times or a ~25.72% increase in odds of cancelling.no_of_special_requests: Holding all other features constant a 1 unit change in no_of_special_requests will decrease the odds of the guest cancelling by ~.22 times or a 77.01% decrease in odds of cancelling.required_car_parking_space: Holding all other features constant a 1 unit change in required_car_parking_space will decrease the odds of the guest cancelling by ~0.203 times or a 79.7% decrease in odds of cancelling.repeat_guest: Holding all other features constant a 1 unit change in repeat_guest will decrease the odds of cancelling by ~0.06 times or a 93.52% decrease in odds of cancelling.Interpretation for other attributes can be done similarly.
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_train2, y_train)
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg2, X_train2, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.805411 | 0.632548 | 0.739033 | 0.681657 |
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_test2, y_test)
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg2, X_test2, y_test
)
print("Training performance:")
log_reg_model_test_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.804649 | 0.630892 | 0.729003 | 0.676408 |
# Plot the False Positive Rate (FPR) vs True Positive Rate (TPR)
logit_roc_auc_train = roc_auc_score(y_train, lg2.predict(X_train2))
fpr, tpr, thresholds = roc_curve(y_train, lg2.predict(X_train2))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg2.predict(X_train2))
# Find the optimal threshold by finding the maximum value between TRP and FPR.
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(f"The AUC-ROC optimal threshold is {optimal_threshold_auc_roc}")
The AUC-ROC optimal threshold is 0.3710466623490246
# creating confusion matrix
confusion_matrix_statsmodels(
lg2, X_train2, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg2, X_train2, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.792888 | 0.735621 | 0.668696 | 0.700564 |
# Plot the False Positive Rate (FPR) vs True Positive Rate (TPR)
logit_roc_auc_test = roc_auc_score(y_test, lg2.predict(X_test2))
fpr, tpr, thresholds = roc_curve(y_test, lg2.predict(X_test2))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_test2, y_test, threshold=optimal_threshold_auc_roc)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg2, X_test2, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.796012 | 0.739353 | 0.666667 | 0.701131 |
# Plot the Precision vs Recall intersecting line graph
y_scores = lg2.predict(X_train2)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# Find intersection points where precision equals recall
# Once the intersection is found, set the optimal_threshold_curve value
intersection_points = []
for i in range(len(prec)):
if prec[i] == rec[i]:
optimal_threshold_curve = tre[i]
print(f"optimal_threshold_curve is {optimal_threshold_curve} ")
optimal_threshold_curve is 0.4209574614254219
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_train2, y_train, threshold=optimal_threshold_curve)
# Calculate the model's performance metrics against the training data set
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg2, X_train2, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.801749 | 0.698912 | 0.699079 | 0.698995 |
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_test2, y_test, threshold=optimal_threshold_curve)
# Calculate the model's performance metrics against the test data set
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg2, X_test2, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.804098 | 0.703010 | 0.695115 | 0.699040 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold (0.5)",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression-default Threshold (0.5) | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.805411 | 0.792888 | 0.801749 |
| Recall | 0.632548 | 0.735621 | 0.698912 |
| Precision | 0.739033 | 0.668696 | 0.699079 |
| F1 | 0.681657 | 0.700564 | 0.698995 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression-default Threshold (0.5)",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression-default Threshold (0.5) | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.804649 | 0.796012 | 0.804098 |
| Recall | 0.630892 | 0.739353 | 0.703010 |
| Precision | 0.729003 | 0.666667 | 0.695115 |
| F1 | 0.676408 | 0.701131 | 0.699040 |
We have been able to build a predictive model that can be used by the INN Hotel to better predict when guests may cancel their hotel room booking with an f1_score of ~.701 on the training set and formulate new cancellation and refund policies.
All the logistic regression models have given a generalized performance on the training and test set.
Coefficient of no_of_adults, no_of_children, no_of_weekend_nights, no_of_week_nights, lead_time, arrival_year, no_of_previous_cancellations, avg_price_per_room, type_of_meal_plan_Meal Plan 2, and type_of_meal_plan_NotSelected are positive and therefore an increase in these will lead to increase in the chances of a guest cancelling their booking.
Coefficient of required_car_parking_space, arrival_month, repeated_guest, no_of_special_requests, room_type_reserved_Room_type 2, room_type_reserved_Room_Type 4, room_type_reserfved_room_type 5, room_type_reserved_Room_type 6, room_type_reserved_Room_type 7, market_segment_type_Corporate, and market_segment_type_Offline are negative and therefore decreases the chances of a person cancelling a booking.
#Since we went through detailed EDA on the original dataset we will not go through it again.
#Decision Trees models are not impacted by multi-collinearity and therefore we should start with the original data sets.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36275 non-null int64 1 no_of_children 36275 non-null int64 2 no_of_weekend_nights 36275 non-null int64 3 no_of_week_nights 36275 non-null int64 4 type_of_meal_plan 36275 non-null object 5 required_car_parking_space 36275 non-null int64 6 room_type_reserved 36275 non-null object 7 lead_time 36275 non-null int64 8 arrival_year 36275 non-null int64 9 arrival_month 36275 non-null int64 10 arrival_date 36275 non-null int64 11 market_segment_type 36275 non-null object 12 repeated_guest 36275 non-null int64 13 no_of_previous_cancellations 36275 non-null int64 14 no_of_previous_bookings_not_canceled 36275 non-null int64 15 avg_price_per_room 36275 non-null float64 16 no_of_special_requests 36275 non-null int64 17 booking_status 36275 non-null int64 dtypes: float64(1), int64(14), object(3) memory usage: 5.0+ MB
# specifying the independent and dependent variables
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]
# adding a constant to the independent variables
X = sm.add_constant(X)
# creating dummy variables
X = pd.get_dummies(X, drop_first=True)
# splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (25392, 28) Shape of test set : (10883, 28) Percentage of classes in training set: 0 0.670644 1 0.329356 Name: booking_status, dtype: float64 Percentage of classes in test set: 0 0.676376 1 0.323624 Name: booking_status, dtype: float64
# Create a Decision Tree Classifier using a random_State=1 for repeatability of data sets
model0 = DecisionTreeClassifier(random_state=1)
# Firt the model using training data
model0.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
# Let's display the confusion matrix for the default decision tree model using training data.
confusion_matrix_sklearn(model0, X_train, y_train)
# Display the Decision Tree default model performance metrics for the Training Data
decision_tree_perf_train_without = model_performance_classification_sklearn(
model0, X_train, y_train
)
decision_tree_perf_train_without
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.994211 | 0.986608 | 0.995776 | 0.991171 |
# Display the Decision Tree default model performance metrics for the Test Data
decision_tree_perf_test_without = model_performance_classification_sklearn(
model0, X_test, y_test
)
decision_tree_perf_test_without
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.874299 | 0.814026 | 0.800838 | 0.807378 |
Yes, based on the performance metrics for both the Training and Test data sets (see above), the Decision Tree should be pruned.
# Plot the important features using a bar chart
feature_names = list(X_train.columns)
importances = model0.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Type of scoring used to compare parameter combinations
# Use f1_score since it's important to equally try and reduce FP and FN.
acc_scorer = make_scorer(f1_score)
# Run the grid search
# The GridSearchCV runs through all combinations of the parameters that can then be used
# to select the best estimator.
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)# Create the confusion matrix using Training data.
confusion_matrix_sklearn(estimator, X_train, y_train)
# Create the new model's performance metrics against the training data.
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.831010 | 0.786201 | 0.724278 | 0.753971 |
# Create the confusion matrix against the test data.
confusion_matrix_sklearn(estimator, X_test, y_test)
# create the new model's performance metrics against the test data.
decision_tree_tune_perf_test = model_performance_classification_sklearn(
estimator, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.834972 | 0.783362 | 0.727584 | 0.754444 |
# Plot the important features in a bar plot.
feature_names = list(X_train.columns)
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Create a tree visualation graph of the Decision Tree Model
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- weights: [1736.39, 132.08] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 25.81] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- weights: [960.27, 223.16] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- weights: [129.73, 160.92] class: 1 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- weights: [214.72, 227.72] class: 1 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- weights: [82.76, 285.41] class: 1 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- weights: [87.23, 81.98] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- weights: [228.14, 48.58] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- weights: [363.83, 132.08] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- weights: [219.94, 85.01] class: 0 | | | | | |--- lead_time > 3.50 | | | | | | |--- weights: [132.71, 280.85] class: 1 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- weights: [158.80, 159.40] class: 1 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- weights: [850.67, 3543.28] class: 1 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- weights: [15.66, 9.11] class: 0 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- weights: [32.06, 19.74] class: 0 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- weights: [498.03, 44.03] class: 0 | | | | | |--- lead_time > 4.50 | | | | | | |--- weights: [258.71, 63.76] class: 0 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- weights: [2512.51, 1451.32] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [180.42, 57.69] class: 0 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- weights: [184.90, 56.17] class: 0 | | | | | |--- arrival_month > 8.50 | | | | | | |--- weights: [106.61, 106.27] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- weights: [3.73, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- weights: [257.96, 62.24] class: 0 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- weights: [213.97, 385.60] class: 1 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- weights: [23.86, 1030.80] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- weights: [7.46, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- weights: [37.28, 4.55] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [20.13, 212.54] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- weights: [231.12, 110.82] class: 0 | | | | | |--- arrival_month > 11.50 | | | | | | |--- weights: [19.38, 34.92] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
Using the above extracted decision rules we can make interpretations from the decision tree model like:
Interpretations from other decision rules can be made similarly
# Importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
# Plot the most important features
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
lead_time, market_segement_type_Online, no_of_special_requests, and avg_price_per_room.The DecisionTreeClassifier provides parameters such as
min_samples_leaf and max_depth to prevent a tree from overfiting. Cost
complexity pruning provides another option to control the size of a tree. In
DecisionTreeClassifier, this pruning technique is parameterized by the
cost complexity parameter, ccp_alpha. Greater values of ccp_alpha
increase the number of nodes pruned. Here we only show the effect of
ccp_alpha on regularizing the trees and how to choose a ccp_alpha
based on validation scores.
Minimal cost complexity pruning recursively finds the node with the "weakest
link". The weakest link is characterized by an effective alpha, where the
nodes with the smallest effective alpha are pruned first. To get an idea of
what values of ccp_alpha could be appropriate, scikit-learn provides
DecisionTreeClassifier.cost_complexity_pruning_path that returns the
effective alphas and the corresponding total leaf impurities at each step of
the pruning process. As alpha increases, more of the tree is pruned, which
increases the total impurity of its leaves.
# Initialize the Decision Tree Classifier with a random_state=1 for reproducibility
# and a class_weight=balanced to balance the influece of different classes during training
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Calculate the pruning path for the decision tree classifier using cost complexity pruning.
path = clf.cost_complexity_pruning_path(X_train, y_train)
#Extract the alpha values along with their associated impurities
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
# Convert to a panda dataframe to show the first 10 rows to verify.
# Path contains the values of the alpha (the complexity parameter) and the corresponding impurities
# for different pruning levels.
pd.DataFrame(path).head(10)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.008376 |
| 1 | 0.000000 | 0.008376 |
| 2 | 0.000000 | 0.008376 |
| 3 | 0.000000 | 0.008376 |
| 4 | 0.000000 | 0.008376 |
| 5 | 0.000000 | 0.008376 |
| 6 | 0.000000 | 0.008376 |
| 7 | 0.000000 | 0.008376 |
| 8 | 0.000000 | 0.008376 |
| 9 | 0.000000 | 0.008376 |
# Convert to a panda dataframe to show the last 10 rows to verify.
# Path contains the values of the alpha (the complexity parameter) and the corresponding impurities
# for different pruning levels.
pd.DataFrame(path).tail(10)
| ccp_alphas | impurities | |
|---|---|---|
| 1834 | 0.002967 | 0.296306 |
| 1835 | 0.003095 | 0.299401 |
| 1836 | 0.003936 | 0.303338 |
| 1837 | 0.004547 | 0.307885 |
| 1838 | 0.005636 | 0.319156 |
| 1839 | 0.008902 | 0.328058 |
| 1840 | 0.009802 | 0.337860 |
| 1841 | 0.012719 | 0.350579 |
| 1842 | 0.034121 | 0.418821 |
| 1843 | 0.081179 | 0.500000 |
# Plot effective alphas vs total impurity of leaves
# Remove the last alpha/impurities node which corresponds to a fully pruned tree.
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.# clfs will be used to store the decision tree classifiers trained for different alpha values.
# Initialize to an empty list before we start loading
clfs = []
# Loop through each alpha value and create a Decision Tree Classifier for that particular alpha value
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
#Use the decision tree classifier to train the model
clf.fit(X_train, y_train)
#Store the trained decision tree classifier
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.08117914389136943
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.
# Remove the last element that represents the fully pruned tree (since it doesn't add any value)
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
# Create a node_count list that contains the number of nodes for each classifier
node_counts = [clf.tree_.node_count for clf in clfs]
# Get the max depth value for each classifier and store in the depth list.
depth = [clf.tree_.max_depth for clf in clfs]
# Plot the number of nodes vs alphas
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
# Plot the maximum depth of tree vs alphas
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
# For each decision tree classifier predict the training values and then calculate and store the F1 score.
# f1_train will contain the list of f1_scores for each decision tree classifer trained.
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
# For each decision tree classifier predict the testing values and then calculate and store the F1 score.
# f1_test will contain the list of f1_scores for each decision tree classifer on test data.
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
# Plot the alpha vs F1 Scores for both Training and Test data sets
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# Creating the model where we get highest test f1_score
index_best_model = np.argmax(f1_test)
# Get the best_model from the highest test f1_score.
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00012267633155167002,
class_weight='balanced', random_state=1)
# Create a confusion matrix of the best_model against the training data.
confusion_matrix_sklearn(best_model, X_train, y_train)
# Calculate the best_model's performance metrics against the training data.
decision_tree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
decision_tree_post_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.899575 | 0.903145 | 0.812762 | 0.855573 |
# Create the confusion matrix for the best_model against the test data.
confusion_matrix_sklearn(best_model, X_test, y_test)
# Calculate the best_model's performance metrics against the test data.
decision_tree_post_perf_test = model_performance_classification_sklearn(
best_model, X_test, y_test
)
decision_tree_post_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.868419 | 0.855764 | 0.765363 | 0.808043 |
# Plot the tree structure of the best_model
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# Draw the arrows
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree of the best_model
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | |--- lead_time <= 16.50 | | | | | | | | |--- avg_price_per_room <= 68.50 | | | | | | | | | |--- weights: [207.26, 10.63] class: 0 | | | | | | | | |--- avg_price_per_room > 68.50 | | | | | | | | | |--- arrival_date <= 29.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | |--- arrival_date > 29.50 | | | | | | | | | | |--- weights: [2.24, 7.59] class: 1 | | | | | | | |--- lead_time > 16.50 | | | | | | | | |--- avg_price_per_room <= 135.00 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- no_of_previous_bookings_not_canceled <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- no_of_previous_bookings_not_canceled > 0.50 | | | | | | | | | | | |--- weights: [11.18, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [21.62, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 135.00 | | | | | | | | | |--- weights: [0.00, 12.14] class: 1 | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | |--- weights: [1199.59, 0.00] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 25.81] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- arrival_month <= 9.50 | | | | | | | |--- avg_price_per_room <= 63.29 | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- weights: [41.75, 0.00] class: 0 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | |--- avg_price_per_room <= 59.75 | | | | | | | | | | |--- arrival_date <= 23.50 | | | | | | | | | | | |--- weights: [1.49, 12.14] class: 1 | | | | | | | | | | |--- arrival_date > 23.50 | | | | | | | | | | | |--- weights: [14.91, 1.52] class: 0 | | | | | | | | | |--- avg_price_per_room > 59.75 | | | | | | | | | | |--- lead_time <= 44.00 | | | | | | | | | | | |--- weights: [0.75, 59.21] class: 1 | | | | | | | | | | |--- lead_time > 44.00 | | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 63.29 | | | | | | | | |--- no_of_weekend_nights <= 3.50 | | | | | | | | | |--- lead_time <= 59.50 | | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 59.50 | | | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | | | |--- weights: [20.13, 0.00] class: 0 | | | | | | | | |--- no_of_weekend_nights > 3.50 | | | | | | | | | |--- weights: [0.75, 15.18] class: 1 | | | | | | |--- arrival_month > 9.50 | | | | | | | |--- weights: [413.04, 27.33] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- avg_price_per_room <= 99.98 | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | |--- avg_price_per_room <= 62.50 | | | | | | | | | |--- weights: [15.66, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 62.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- weights: [8.20, 25.81] class: 1 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | |--- arrival_month > 3.50 | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | |--- weights: [55.17, 3.04] class: 0 | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | |--- lead_time <= 73.50 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | | |--- lead_time > 73.50 | | | | | | | | | | |--- weights: [21.62, 4.55] class: 0 | | | | | | |--- avg_price_per_room > 99.98 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- avg_price_per_room <= 132.43 | | | | | | | | | |--- weights: [9.69, 122.97] class: 1 | | | | | | | | |--- avg_price_per_room > 132.43 | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- avg_price_per_room <= 75.07 | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | |--- avg_price_per_room <= 58.75 | | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 58.75 | | | | | | | | | |--- no_of_previous_bookings_not_canceled <= 1.00 | | | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | | | |--- weights: [2.24, 118.41] class: 1 | | | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- no_of_previous_bookings_not_canceled > 1.00 | | | | | | | | | | |--- weights: [4.47, 0.00] class: 0 | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | |--- arrival_date <= 11.50 | | | | | | | | | |--- weights: [31.31, 0.00] class: 0 | | | | | | | | |--- arrival_date > 11.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- weights: [23.11, 6.07] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [5.96, 9.11] class: 1 | | | | | | |--- avg_price_per_room > 75.07 | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | |--- weights: [59.64, 3.04] class: 0 | | | | | | | |--- arrival_month > 3.50 | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | |--- weights: [1.49, 16.70] class: 1 | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- avg_price_per_room <= 86.00 | | | | | | | | | | | |--- weights: [2.24, 16.70] class: 1 | | | | | | | | | | |--- avg_price_per_room > 86.00 | | | | | | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [44.73, 4.55] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- arrival_date <= 11.50 | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | |--- weights: [16.40, 39.47] class: 1 | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | |--- weights: [20.13, 6.07] class: 0 | | | | | | |--- arrival_date > 11.50 | | | | | | | |--- avg_price_per_room <= 102.09 | | | | | | | | |--- weights: [5.22, 144.22] class: 1 | | | | | | | |--- avg_price_per_room > 102.09 | | | | | | | | |--- avg_price_per_room <= 109.50 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [0.75, 16.70] class: 1 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- weights: [33.55, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 109.50 | | | | | | | | | |--- avg_price_per_room <= 124.25 | | | | | | | | | | |--- weights: [2.98, 75.91] class: 1 | | | | | | | | | |--- avg_price_per_room > 124.25 | | | | | | | | | | |--- weights: [3.73, 3.04] class: 0 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- arrival_date <= 7.50 | | | | | | | |--- weights: [38.02, 0.00] class: 0 | | | | | | |--- arrival_date > 7.50 | | | | | | | |--- avg_price_per_room <= 93.58 | | | | | | | | |--- avg_price_per_room <= 65.38 | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | |--- avg_price_per_room > 65.38 | | | | | | | | | |--- weights: [24.60, 3.04] class: 0 | | | | | | | |--- avg_price_per_room > 93.58 | | | | | | | | |--- arrival_date <= 28.00 | | | | | | | | | |--- weights: [14.91, 72.87] class: 1 | | | | | | | | |--- arrival_date > 28.00 | | | | | | | | | |--- weights: [9.69, 1.52] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- no_of_adults <= 1.50 | | | | | | | |--- weights: [84.25, 0.00] class: 0 | | | | | | |--- no_of_adults > 1.50 | | | | | | | |--- lead_time <= 125.50 | | | | | | | | |--- avg_price_per_room <= 90.85 | | | | | | | | | |--- avg_price_per_room <= 87.50 | | | | | | | | | | |--- weights: [13.42, 13.66] class: 1 | | | | | | | | | |--- avg_price_per_room > 87.50 | | | | | | | | | | |--- weights: [0.00, 15.18] class: 1 | | | | | | | | |--- avg_price_per_room > 90.85 | | | | | | | | | |--- weights: [10.44, 0.00] class: 0 | | | | | | | |--- lead_time > 125.50 | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | |--- weights: [58.15, 18.22] class: 0 | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | |--- weights: [61.88, 1.52] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | |--- avg_price_per_room <= 70.05 | | | | | | | | | |--- weights: [31.31, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 70.05 | | | | | | | | | |--- lead_time <= 5.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [38.77, 1.52] class: 0 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lead_time > 5.50 | | | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | | | |--- weights: [34.30, 40.99] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | |--- weights: [0.00, 19.74] class: 1 | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | |--- lead_time <= 2.50 | | | | | | | | | | |--- avg_price_per_room <= 74.21 | | | | | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | | | | | | |--- avg_price_per_room > 74.21 | | | | | | | | | | | |--- weights: [9.69, 0.00] class: 0 | | | | | | | | | |--- lead_time > 2.50 | | | | | | | | | | |--- weights: [4.47, 10.63] class: 1 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | |--- weights: [155.07, 6.07] class: 0 | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- weights: [3.73, 10.63] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [7.46, 0.00] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- avg_price_per_room <= 202.67 | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- weights: [63.37, 30.36] class: 0 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | | |--- weights: [115.56, 12.14] class: 0 | | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | | |--- arrival_date <= 24.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_date > 24.50 | | | | | | | | | | | |--- weights: [28.33, 3.04] class: 0 | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | |--- weights: [0.00, 6.07] class: 1 | | | | | | |--- avg_price_per_room > 202.67 | | | | | | | |--- weights: [0.75, 22.77] class: 1 | | | | | |--- lead_time > 3.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- avg_price_per_room <= 119.25 | | | | | | | | |--- avg_price_per_room <= 118.50 | | | | | | | | | |--- weights: [18.64, 59.21] class: 1 | | | | | | | | |--- avg_price_per_room > 118.50 | | | | | | | | | |--- weights: [8.20, 1.52] class: 0 | | | | | | | |--- avg_price_per_room > 119.25 | | | | | | | | |--- weights: [34.30, 171.55] class: 1 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- weights: [26.09, 1.52] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_date <= 14.00 | | | | | | | | | | |--- weights: [9.69, 36.43] class: 1 | | | | | | | | | |--- arrival_date > 14.00 | | | | | | | | | | |--- avg_price_per_room <= 208.67 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 208.67 | | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [15.66, 0.00] class: 0 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- avg_price_per_room <= 59.43 | | | | | | | |--- lead_time <= 84.50 | | | | | | | | |--- weights: [50.70, 7.59] class: 0 | | | | | | | |--- lead_time > 84.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_date <= 27.00 | | | | | | | | | | |--- lead_time <= 131.50 | | | | | | | | | | | |--- weights: [0.75, 15.18] class: 1 | | | | | | | | | | |--- lead_time > 131.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 27.00 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- weights: [10.44, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 59.43 | | | | | | | |--- lead_time <= 25.50 | | | | | | | | |--- weights: [20.88, 6.07] class: 0 | | | | | | | |--- lead_time > 25.50 | | | | | | | | |--- avg_price_per_room <= 71.34 | | | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | | | |--- lead_time <= 68.50 | | | | | | | | | | | |--- weights: [15.66, 78.94] class: 1 | | | | | | | | | | |--- lead_time > 68.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- arrival_month > 3.50 | | | | | | | | | | |--- lead_time <= 102.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 102.00 | | | | | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | | | | | |--- avg_price_per_room > 71.34 | | | | | | | | | |--- weights: [11.18, 0.00] class: 0 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- arrival_year <= 2017.50 | | | | | | | |--- lead_time <= 65.50 | | | | | | | | |--- avg_price_per_room <= 120.45 | | | | | | | | | |--- weights: [79.77, 9.11] class: 0 | | | | | | | | |--- avg_price_per_room > 120.45 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- weights: [3.73, 12.14] class: 1 | | | | | | | |--- lead_time > 65.50 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- weights: [16.40, 47.06] class: 1 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | |--- weights: [0.00, 63.76] class: 1 | | | | | | |--- arrival_year > 2017.50 | | | | | | | |--- avg_price_per_room <= 104.31 | | | | | | | | |--- lead_time <= 25.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | | |--- weights: [16.40, 0.00] class: 0 | | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | | |--- weights: [38.77, 118.41] class: 1 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [23.11, 0.00] class: 0 | | | | | | | | |--- lead_time > 25.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- weights: [39.51, 185.21] class: 1 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- weights: [73.81, 411.41] class: 1 | | | | | | | |--- avg_price_per_room > 104.31 | | | | | | | | |--- arrival_month <= 10.50 | | | | | | | | | |--- room_type_reserved_Room_Type 5 <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 195.30 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- avg_price_per_room > 195.30 | | | | | | | | | | | |--- weights: [0.75, 138.15] class: 1 | | | | | | | | | |--- room_type_reserved_Room_Type 5 > 0.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [11.18, 6.07] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- weights: [0.75, 9.11] class: 1 | | | | | | | | |--- arrival_month > 10.50 | | | | | | | | | |--- avg_price_per_room <= 168.06 | | | | | | | | | | |--- lead_time <= 22.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 22.00 | | | | | | | | | | | |--- weights: [17.15, 83.50] class: 1 | | | | | | | | | |--- avg_price_per_room > 168.06 | | | | | | | | | | |--- weights: [12.67, 6.07] class: 0 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- lead_time <= 63.00 | | | | | | | |--- weights: [15.66, 1.52] class: 0 | | | | | | |--- lead_time > 63.00 | | | | | | | |--- weights: [0.00, 7.59] class: 1 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- weights: [31.31, 13.66] class: 0 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- weights: [0.75, 6.07] class: 1 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- no_of_week_nights <= 10.00 | | | | | | | |--- weights: [498.03, 40.99] class: 0 | | | | | | |--- no_of_week_nights > 10.00 | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | |--- lead_time > 4.50 | | | | | | |--- arrival_date <= 13.50 | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | |--- weights: [58.90, 36.43] class: 0 | | | | | | | |--- arrival_month > 9.50 | | | | | | | | |--- weights: [33.55, 1.52] class: 0 | | | | | | |--- arrival_date > 13.50 | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | |--- weights: [123.76, 9.11] class: 0 | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | |--- avg_price_per_room <= 126.33 | | | | | | | | | |--- weights: [32.80, 3.04] class: 0 | | | | | | | | |--- avg_price_per_room > 126.33 | | | | | | | | | |--- weights: [9.69, 13.66] class: 1 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 118.55 | | | | | | | |--- lead_time <= 61.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | |--- weights: [70.08, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [126.74, 1.52] class: 0 | | | | | | | |--- lead_time > 61.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | |--- weights: [4.47, 57.69] class: 1 | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | |--- lead_time <= 66.50 | | | | | | | | | | | |--- weights: [5.22, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 66.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | |--- avg_price_per_room <= 71.93 | | | | | | | | | | | |--- weights: [54.43, 3.04] class: 0 | | | | | | | | | | |--- avg_price_per_room > 71.93 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | |--- avg_price_per_room > 118.55 | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | |--- no_of_week_nights <= 7.50 | | | | | | | | | | |--- avg_price_per_room <= 177.15 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- avg_price_per_room > 177.15 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- no_of_week_nights > 7.50 | | | | | | | | | | |--- weights: [0.00, 6.07] class: 1 | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- avg_price_per_room <= 121.20 | | | | | | | | | | | |--- weights: [18.64, 6.07] class: 0 | | | | | | | | | | |--- avg_price_per_room > 121.20 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- lead_time <= 55.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 55.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- arrival_month > 8.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | |--- weights: [11.93, 10.63] class: 0 | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | |--- weights: [37.28, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- avg_price_per_room <= 119.20 | | | | | | | | | | | |--- weights: [9.69, 28.84] class: 1 | | | | | | | | | | |--- avg_price_per_room > 119.20 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- lead_time <= 100.00 | | | | | | | | | | | |--- weights: [49.95, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 100.00 | | | | | | | | | | | |--- weights: [0.75, 18.22] class: 1 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- no_of_week_nights <= 9.50 | | | | | | | |--- lead_time <= 6.50 | | | | | | | | |--- weights: [32.06, 0.00] class: 0 | | | | | | | |--- lead_time > 6.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_date <= 5.50 | | | | | | | | | | |--- weights: [23.11, 1.52] class: 0 | | | | | | | | | |--- arrival_date > 5.50 | | | | | | | | | | |--- avg_price_per_room <= 93.09 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 93.09 | | | | | | | | | | | |--- weights: [77.54, 27.33] class: 0 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [19.38, 0.00] class: 0 | | | | | | |--- no_of_week_nights > 9.50 | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- avg_price_per_room <= 202.95 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | |--- weights: [1.49, 9.11] class: 1 | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | |--- weights: [8.20, 3.04] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- lead_time <= 150.50 | | | | | | | | | |--- weights: [175.20, 28.84] class: 0 | | | | | | | | |--- lead_time > 150.50 | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | |--- avg_price_per_room > 202.95 | | | | | | | |--- weights: [0.00, 10.63] class: 1 | | | | | |--- arrival_month > 8.50 | | | | | | |--- avg_price_per_room <= 153.15 | | | | | | | |--- room_type_reserved_Room_Type 2 <= 0.50 | | | | | | | | |--- avg_price_per_room <= 71.12 | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 71.12 | | | | | | | | | |--- avg_price_per_room <= 90.42 | | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | | |--- weights: [12.67, 7.59] class: 0 | | | | | | | | | |--- avg_price_per_room > 90.42 | | | | | | | | | | |--- weights: [64.12, 60.72] class: 0 | | | | | | | |--- room_type_reserved_Room_Type 2 > 0.50 | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 153.15 | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- lead_time <= 160.50 | | | | | | | |--- weights: [2.98, 0.00] class: 0 | | | | | | |--- lead_time > 160.50 | | | | | | | |--- weights: [0.75, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- lead_time <= 341.00 | | | | | | | |--- lead_time <= 173.00 | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | |--- weights: [46.97, 9.11] class: 0 | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.00 | | | | | | | | | | |--- weights: [0.00, 13.66] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 1.00 | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | |--- lead_time > 173.00 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- arrival_date <= 7.50 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | | |--- arrival_date > 7.50 | | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- weights: [188.62, 7.59] class: 0 | | | | | | |--- lead_time > 341.00 | | | | | | | |--- weights: [13.42, 27.33] class: 1 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- lead_time <= 285.50 | | | | | | | |--- weights: [8.20, 0.00] class: 0 | | | | | | |--- lead_time > 285.50 | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- arrival_month <= 11.50 | | | | | | | |--- lead_time <= 244.00 | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- lead_time <= 166.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 166.50 | | | | | | | | | | | |--- weights: [2.24, 57.69] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [17.89, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | | |--- weights: [11.18, 3.04] class: 0 | | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | | |--- weights: [0.00, 12.14] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- weights: [75.30, 12.14] class: 0 | | | | | | | |--- lead_time > 244.00 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- weights: [25.35, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | | |--- weights: [11.18, 264.15] class: 1 | | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [7.46, 0.00] class: 0 | | | | | | |--- arrival_month > 11.50 | | | | | | | |--- weights: [46.22, 0.00] class: 0 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- lead_time <= 324.50 | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | |--- weights: [7.46, 986.78] class: 1 | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- weights: [0.00, 10.63] class: 1 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- weights: [4.47, 0.00] class: 0 | | | | | | | |--- arrival_month > 11.50 | | | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | | | |--- weights: [5.22, 0.00] class: 0 | | | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | | | |--- weights: [0.00, 19.74] class: 1 | | | | | | |--- lead_time > 324.50 | | | | | | | |--- avg_price_per_room <= 89.00 | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 89.00 | | | | | | | | |--- weights: [0.75, 13.66] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- weights: [1.49, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- arrival_date <= 1.50 | | | | | | | |--- weights: [1.49, 3.04] class: 1 | | | | | | |--- arrival_date > 1.50 | | | | | | | |--- weights: [35.79, 1.52] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | |--- avg_price_per_room <= 96.37 | | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | | | | |--- avg_price_per_room > 96.37 | | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | |--- weights: [7.46, 206.46] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- avg_price_per_room <= 76.48 | | | | | | | |--- weights: [46.97, 4.55] class: 0 | | | | | | |--- avg_price_per_room > 76.48 | | | | | | | |--- no_of_week_nights <= 6.50 | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | |--- lead_time <= 233.00 | | | | | | | | | | |--- lead_time <= 152.50 | | | | | | | | | | | |--- weights: [1.49, 4.55] class: 1 | | | | | | | | | | |--- lead_time > 152.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 233.00 | | | | | | | | | | |--- weights: [23.11, 19.74] class: 0 | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [2.24, 15.18] class: 1 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- lead_time <= 269.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 269.00 | | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | |--- no_of_week_nights > 6.50 | | | | | | | | |--- weights: [4.47, 13.66] class: 1 | | | | | |--- arrival_month > 11.50 | | | | | | |--- arrival_date <= 14.50 | | | | | | | |--- weights: [8.20, 3.04] class: 0 | | | | | | |--- arrival_date > 14.50 | | | | | | | |--- weights: [11.18, 31.88] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
# Plot the important features
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
lead_time, market_segment_type_Online, avg_price_per_room, no_of_special_requests, and arrival_month.# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train_without.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.994211 | 0.831010 | 0.899575 |
| Recall | 0.986608 | 0.786201 | 0.903145 |
| Precision | 0.995776 | 0.724278 | 0.812762 |
| F1 | 0.991171 | 0.753971 | 0.855573 |
# Test performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test_without.T,
decision_tree_tune_perf_test.T,
decision_tree_post_perf_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.874299 | 0.834972 | 0.868419 |
| Recall | 0.814026 | 0.783362 | 0.855764 |
| Precision | 0.800838 | 0.727584 | 0.765363 |
| F1 | 0.807378 | 0.754444 | 0.808043 |
# Test performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf_threshold_auc_roc.T,
decision_tree_post_perf_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression-0.37 Threshold",
"Decision Tree (Post-Pruning)",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Logistic Regression-0.37 Threshold | Decision Tree (Post-Pruning) | |
|---|---|---|
| Accuracy | 0.796012 | 0.868419 |
| Recall | 0.739353 | 0.855764 |
| Precision | 0.666667 | 0.765363 |
| F1 | 0.701131 | 0.808043 |
- lead_time
- market_segment_type_Online
- avg_price_per_room
- no_special_requests
- arrival_month